R-exercise 5

Loading and exploring the data

library(readr)
human <- read.csv("~/Documents/GitHub/IODS-project/data/human.csv", row.names=1)
str(human)
## 'data.frame':    155 obs. of  9 variables:
##  $ country     : Factor w/ 155 levels "Afghanistan",..: 105 6 134 41 101 54 67 149 27 102 ...
##  $ edu2F       : num  97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
##  $ labF        : num  61.2 58.8 61.8 58.7 58.5 53.6 53.1 56.3 61.6 62 ...
##  $ lifeExp     : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ expSchool   : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ gni         : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ mamo.ratio  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ adobirthrate: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ parlrep     : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
##         country        edu2F             labF          lifeExp     
##  Afghanistan:  1   Min.   :  0.90   Min.   :13.50   Min.   :49.00  
##  Albania    :  1   1st Qu.: 27.15   1st Qu.:44.45   1st Qu.:66.30  
##  Algeria    :  1   Median : 56.60   Median :53.50   Median :74.20  
##  Argentina  :  1   Mean   : 55.37   Mean   :52.52   Mean   :71.65  
##  Armenia    :  1   3rd Qu.: 85.15   3rd Qu.:61.90   3rd Qu.:77.25  
##  Australia  :  1   Max.   :100.00   Max.   :88.10   Max.   :83.50  
##  (Other)    :149                                                   
##    expSchool          gni           mamo.ratio      adobirthrate   
##  Min.   : 5.40   Min.   :   581   Min.   :   1.0   Min.   :  0.60  
##  1st Qu.:11.25   1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65  
##  Median :13.50   Median : 12040   Median :  49.0   Median : 33.60  
##  Mean   :13.18   Mean   : 17628   Mean   : 149.1   Mean   : 47.16  
##  3rd Qu.:15.20   3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95  
##  Max.   :20.20   Max.   :123124   Max.   :1100.0   Max.   :204.80  
##                                                                    
##     parlrep     
##  Min.   : 0.00  
##  1st Qu.:12.40  
##  Median :19.30  
##  Mean   :20.91  
##  3rd Qu.:27.95  
##  Max.   :57.50  
## 

human dataframe has 155 observations of 9 variables:

  • country - country name
  • edu2F - proportion of females with at least secondary education
  • labF - proportion of females in the labour force
  • lifeExp - life expectancy at birth
  • expSchool - expected years of schooling
  • gni - gross national income per capita
  • mamo.ratio - maternal mortality ratio
  • parlrep - percetange of female representatives in parliament
  • adobirthrate - adolescent birth rate.

From which edu2F and labF are proportional combined variables.

The database is a combined database of Human Development index and Gender Inequality index, which are a part of United Nations Development Programme’s Human Development reports.

More information about the dataset can be found here.

Technical notes can be found here.

This time there is no binary variable in the data. We can see that for some variables the mean and median are not the same number, so the distributions for them seem skewed. We will see them better next when we’ll plot the variables.

Plotting the data

Let’s start with histograms and density plots, followed by pairs and correlation graphs.

library(purrr)
library(tidyr)
library(ggplot2)

human %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

human %>%
  keep(is.numeric) %>%                     # Keep only numeric columns
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(value)) +                     # Plot the values
    facet_wrap(~ key, scales = "free") +   # In separate panels
    geom_density()

Some of the variables like expSchool, labF and parlrep look like they might be normally distributed. edu2F might be trimodal. adobirthrate, gni and mamo.ratio are skewed to the right, and lifeExp and expSchool to the left.

Next, I’ll create a categorical variable based on the quartiles of the GNI index, to see if better how the variables differ by their GNI score.

library(GGally)

# create a categorical variable 'gni'
bins <- quantile(human$gni)
gnicat <- cut(human$gni, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))

ggpairs(dplyr::select(human, -country), mapping = aes(col = gnicat, alpha = 0.3), lower = list(combo = wrap("facethist")))

The order of the coloring is red, green, blue, purple from low to high scores in the GNI index.

library("tidyverse")
library("dplyr")
library("corrplot")
# calculate the correlation matrix and round it
cor_matrix<-cor(dplyr::select(human, -country)) %>% round(digits=2)
# print the correlation matrix
#print(cor_matrix)
# visualize the correlation matrix
corrplot.mixed(cor_matrix, tl.cex = 0.7, number.cex = .8)

Countries with higher GNI index score have higher education shares for women, than the lower ones. They also have higher expected years of schooling and life expectancy. Labor participation rate for women is on the higher end in the quartile that scores lowest in GNI. There’s surprisingly little difference between the four GNI quartiles when it comes to women’s parliamental participation rate (there is a presumption that these countries have parliaments). Only the fourth quartile has something of a normal distribution in it and the others are skewed to the right.

Let’s sum up the biggest correlations between the variables (arranged into a decreasing order in absolute value)

library(reshape2)
CM <- cor_matrix                           # Make a copy of the matrix
CM[lower.tri(CM, diag = TRUE)] <- NA       # lower tri and diag set to NA
corr_list<-subset(melt(CM, na.rm = TRUE), abs(value) > .7)
corr_list %>% arrange(desc(abs(value))) -> corr_list2
print(corr_list2)
##         Var1         Var2 value
## 1    lifeExp   mamo.ratio -0.86
## 2    lifeExp    expSchool  0.79
## 3      edu2F    expSchool  0.78
## 4 mamo.ratio adobirthrate  0.76
## 5  expSchool   mamo.ratio -0.74
## 6    lifeExp adobirthrate -0.73
## 7      edu2F      lifeExp  0.72
## 8      edu2F   mamo.ratio -0.72

Life expectancy had strong negative correlation (-0.86) with maternal mortality ratio. This one is pretty self-explanatory.

Life expectancy had a strong positive correlation with the expected years of schooling (0.79), and it is logical as when life expectancy is higher, parents will value schooling more as an investment.

Proportion of females with at least secondary education correlated positively (0.78) with expected years of schooling (of course as higher portion of population attend schooling, also the population average will rise).

Next one is positive correlation (0.76) between maternal mortality ratio and adolescent birth rate, which again makes sense, as higher the adolescent birth rate is (how many childbirths women give during their lives), naturally more mortality there is among these women.

As a last remark, education share for women seems to have a positive relationship with higher life expectancy and smaller maternal mortality ratio.

2 Principal component analysis (PCA)

First we perform the PCA to non-standardised data.

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(dplyr::select(human, -country))

print(pca_human)
## Standard deviations (1, .., p=8):
## [1] 18544.172057   186.283543    25.972416    20.074641    14.321634
## [6]    10.634338     3.721257     1.428190
## 
## Rotation (n x k) = (8 x 8):
##                        PC1          PC2           PC3           PC4
## edu2F        -9.317458e-04  0.085169068 -0.3652346210 -0.8435797499
## labF          9.124960e-05 -0.031442122 -0.0180958993 -0.3928157700
## lifeExp      -2.815825e-04  0.028224407 -0.0170864406 -0.0212996883
## expSchool    -9.562916e-05  0.007556395 -0.0210421918 -0.0308785262
## gni          -9.999828e-01 -0.005830843  0.0006412388  0.0002381021
## mamo.ratio    5.655739e-03 -0.987500960 -0.1481355205 -0.0173448186
## adobirthrate  1.233962e-03 -0.125305410  0.9184673154 -0.3475453954
## parlrep      -5.526462e-05  0.003211742 -0.0038388487 -0.1075784134
##                        PC5           PC6           PC7           PC8
## edu2F        -3.770012e-01 -6.083913e-02  0.0303039622  3.118655e-02
## labF          8.233860e-01  4.077784e-01 -0.0093667016  6.654689e-03
## lifeExp       1.136966e-02 -6.669649e-02 -0.9901494274  1.161211e-01
## expSchool     3.896982e-03 -2.437473e-02 -0.1134676761 -9.925031e-01
## gni           6.651486e-06  2.062449e-05  0.0001028639  1.871381e-05
## mamo.ratio   -3.955532e-02 -2.010447e-02 -0.0244043639 -7.101321e-04
## adobirthrate -1.381061e-01 -2.499459e-02 -0.0127951595 -8.079546e-03
## parlrep       3.989026e-01 -9.077136e-01  0.0704544742  1.925677e-02
# draw a biplot of the principal component representation and the original variables

s <- summary(pca_human)
print(s)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6   PC7
## Standard deviation     1.854e+04 186.2835 25.97 20.07 14.32 10.63 3.721
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00  0.00  0.00 0.000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00  1.00  1.00 1.000
##                          PC8
## Standard deviation     1.428
## Proportion of Variance 0.000
## Cumulative Proportion  1.000
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)

Lets check out the percentages of variance of principal components and make the biplot.

# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot1<-biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

Seems like without standardising the data one principal component explains all of the variation, and the graph does not tell us much.

3 PCA with standardised data

Let’s now standardise the human data.

human_std <- scale(dplyr::select(human, -country))
# summaries of the scaled variables
summary(human_std)
##      edu2F               labF             lifeExp          expSchool      
##  Min.   :-1.76122   Min.   :-2.43186   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.91250   1st Qu.:-0.50289   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.03967   Median : 0.06116   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.96275   3rd Qu.: 0.58469   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 1.44288   Max.   : 2.21762   Max.   : 1.4218   Max.   : 2.4730  
##       gni            mamo.ratio       adobirthrate        parlrep       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# change the object to data frame
human_std <- as.data.frame(human_std)

Then perform the PCA (principal component analysis) for the standardised data.

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)

print(pca_human)
## Standard deviations (1, .., p=8):
## [1] 2.1359720 1.1170602 0.8767066 0.7202736 0.5530904 0.5299143 0.4493522
## [8] 0.3372776
## 
## Rotation (n x k) = (8 x 8):
##                      PC1         PC2          PC3         PC4         PC5
## edu2F        -0.39874523 0.101737848 -0.210481695 -0.31033900  0.43437742
## labF          0.14439668 0.682158952 -0.575062704 -0.27300332 -0.22413576
## lifeExp      -0.43080653 0.003997077  0.073302641 -0.02783772  0.04920168
## expSchool    -0.41858363 0.138149663 -0.073869337 -0.07719277  0.30831448
## gni          -0.33914119 0.098797891 -0.338769060  0.84020987 -0.01249472
## mamo.ratio    0.41991628 0.123208094 -0.145471957  0.28520785  0.05493343
## adobirthrate  0.40271826 0.088902996 -0.003213286  0.11830579  0.81248359
## parlrep      -0.07626861 0.687286162  0.691544283  0.14537131 -0.01724555
##                      PC6         PC7         PC8
## edu2F        -0.50478735  0.48033587  0.12578655
## labF          0.23657515  0.01076780 -0.04754207
## lifeExp       0.57970641  0.04122211  0.68415054
## expSchool    -0.11017948 -0.81713841 -0.13919229
## gni           0.01564322  0.16310711 -0.16583233
## mamo.ratio   -0.43780233 -0.24286926  0.67254029
## adobirthrate  0.36928738  0.07464930 -0.11761127
## parlrep      -0.11284562  0.09265604 -0.02894539
# draw a biplot of the principal component representation and the original variables

s <- summary(pca_human)
print(s)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5    PC6
## Standard deviation     2.1360 1.1171 0.87671 0.72027 0.55309 0.5299
## Proportion of Variance 0.5703 0.1560 0.09608 0.06485 0.03824 0.0351
## Cumulative Proportion  0.5703 0.7263 0.82235 0.88720 0.92544 0.9605
##                            PC7     PC8
## Standard deviation     0.44935 0.33728
## Proportion of Variance 0.02524 0.01422
## Cumulative Proportion  0.98578 1.00000
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)

Lets check out the percentages of variance captured by the principal components and make a biplot.

# print out the percentages of variance
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 57.0 15.6  9.6  6.5  3.8  3.5  2.5  1.4
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot2<-biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
*Figure A: Women's participation in society and childbearing*

Figure A: Women’s participation in society and childbearing

The results are now different. This is because PCA is sensitive to the relative scaling of the original features, thus standardising the data is a good idea.

4 The personal interpretation of the two PCAs

My interpretation of PC#1 is that it describes the relation between education and birthrate/maternal mortality, i.e. when families have less children who live longer, the education will be seen as more and more important. General education is also likely to improve women’s survival of childbirth.

My interpretation of PC#2 is that it describes the women’s participation in the society, i.e. for example representation in the parliament or participation in the labour markets. The principal component seems to be an aggregate of our two variables parlrep and labF and there does not seem to be a variable that has the opposite effect. I think this is because the variables are shares of multiple variables, which we had in the data wrangling section before combining them into this one.

For example in top left corner we have the Scandinavian countries with high schooling and high participation ration for women, and at the bottom we can find countries where women are known to be repressed in the state politics, like Iran, Iraq and Pakistan. These countries are still relatively advanced, as we can see from maternal mortality rate. Some African countries on the right have quite high participation ratios for women, but still maternal mortality rate is quite high.

5 Tea drinkers of the world unite

For the data tea, 300 tea consumers were interviewed about their consumption of tea. The questions were about how they consume tea, how they think of tea and descriptive questions (sex, age, socio-professional category and sport practise). 122 male and 178 female participated, from age 15 to 90, median age being 32 years and mean 37. We have categorical, qualitative and quantitative variables in the data. For the age, the data set has two different variables: a continuous and a categorical one. The data is part of FactoMineR package for R, and more information can be found from the FactoMineR homepage.

library(FactoMineR)
data(tea)
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "age_Q")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))

# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                                                                     
##                   where       age_Q   
##  chain store         :192   15-24:92  
##  chain store+tea shop: 78   25-34:69  
##  tea shop            : 30   35-44:40  
##                             45-59:61  
##                             +60  :38
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ age_Q: Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...

We have selected columns Tea, How, how, sugar, where and age_Q from the data.

After this selection, the dataset tea has 300 observations of 6 variables of type factor.

# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

From the bar plot we can read that most of the people use tea bags and users of unpackaged tea are in minority. It’s likely that tea is bought in bags from a chain store, as the proportions are similar. The tea is usually drank without supplements, with only sugar added. Around half of the drinkers do not use sugar at all. Majority of the tea quality is Earl Grey tea, which is consumed about 2.5 times over the regular black tea.

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.313   0.266   0.249   0.205   0.184   0.175
## % of var.             13.434  11.404  10.689   8.791   7.883   7.511
## Cumulative % of var.  13.434  24.838  35.527  44.318  52.201  59.712
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.168   0.147   0.141   0.135   0.119   0.101
## % of var.              7.218   6.302   6.051   5.787   5.086   4.348
## Cumulative % of var.  66.930  73.231  79.282  85.069  90.155  94.503
##                       Dim.13  Dim.14
## Variance               0.072   0.056
## % of var.              3.094   2.403
## Cumulative % of var.  97.597 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.068  0.005  0.002 |  0.048  0.003  0.001 | -0.471
## 2                  |  0.142  0.021  0.009 |  0.089  0.010  0.004 | -1.069
## 3                  | -0.199  0.042  0.033 | -0.254  0.081  0.053 | -0.600
## 4                  | -0.798  0.678  0.665 | -0.240  0.072  0.060 |  0.064
## 5                  | -0.199  0.042  0.033 | -0.254  0.081  0.053 | -0.600
## 6                  | -0.582  0.360  0.361 | -0.167  0.035  0.030 | -0.235
## 7                  | -0.190  0.039  0.022 | -0.036  0.002  0.001 | -0.411
## 8                  |  0.150  0.024  0.009 |  0.307  0.118  0.036 | -0.880
## 9                  |  0.268  0.076  0.026 |  0.900  1.016  0.290 |  0.175
## 10                 |  0.605  0.390  0.137 |  0.870  0.948  0.282 | -0.073
##                       ctr   cos2  
## 1                   0.296  0.106 |
## 2                   1.527  0.527 |
## 3                   0.481  0.297 |
## 4                   0.005  0.004 |
## 5                   0.481  0.297 |
## 6                   0.074  0.059 |
## 7                   0.226  0.103 |
## 8                   1.035  0.298 |
## 9                   0.041  0.011 |
## 10                  0.007  0.002 |
## 
## Categories (the 10 first)
##                       Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2
## black              |  0.750  7.377  0.184  7.421 |  0.467  3.365  0.071
## Earl Grey          | -0.389  5.179  0.273 -9.036 | -0.016  0.010  0.000
## green              |  0.594  2.063  0.044  3.610 | -0.954  6.272  0.113
## alone              | -0.096  0.316  0.017 -2.253 | -0.262  2.803  0.128
## lemon              |  0.483  1.366  0.029  2.938 |  0.202  0.282  0.005
## milk               | -0.091  0.093  0.002 -0.813 |  0.315  1.307  0.026
## other              |  0.938  1.404  0.027  2.853 |  2.736 14.071  0.232
## tea bag            | -0.460  6.376  0.277 -9.096 | -0.154  0.842  0.031
## tea bag+unpackaged |  0.174  0.504  0.014  2.031 |  0.719 10.150  0.236
## unpackaged         |  1.718 18.837  0.403 10.972 | -1.150  9.948  0.180
##                    v.test    Dim.3    ctr   cos2 v.test  
## black               4.618 | -0.740  9.026  0.179 -7.322 |
## Earl Grey          -0.367 |  0.334  4.788  0.201  7.751 |
## green              -5.800 | -0.293  0.629  0.011 -1.778 |
## alone              -6.183 | -0.047  0.098  0.004 -1.118 |
## lemon               1.229 |  1.264 11.746  0.197  7.685 |
## milk                2.811 | -0.379  2.011  0.038 -3.375 |
## other               8.321 | -0.957  1.836  0.028 -2.910 |
## tea bag            -3.046 | -0.436  7.208  0.249 -8.627 |
## tea bag+unpackaged  8.400 |  0.663  9.193  0.200  7.740 |
## unpackaged         -7.346 |  0.330  0.873  0.015  2.107 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.275 0.154 0.216 |
## How                | 0.060 0.295 0.235 |
## how                | 0.484 0.334 0.259 |
## sugar              | 0.132 0.013 0.200 |
## where              | 0.528 0.632 0.233 |
## age_Q              | 0.402 0.169 0.354 |

From the summary we can see that for the variables alone and other the test variable is smaller in absolute value than the critical value 1.96, so we cannot say that their coefficient is significantly different from zero respect to dim#1.

For dim#2 the same goes for variable black.

# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

The distance between variable categories gives a measure of their similarity.

After plotting the MCA factor map, we can clearly see different groups formed:

Sugar and milk is being added more often to bag tea than unpackaged tea, more often to Earl Grey tea than to black or green. Lemon is added mostly to unpackaged tea. “Other” is a curious case, indicating strongly a Dim2.

dimdesc(mca)
## $`Dim 1`
## $`Dim 1`$quali
##               R2      p.value
## where 0.52835508 3.402919e-49
## how   0.48367544 2.339016e-43
## age_Q 0.40178129 7.360334e-32
## Tea   0.27495543 1.837208e-21
## sugar 0.13225028 8.232202e-11
## How   0.05978394 3.854850e-04
## 
## $`Dim 1`$category
##                         Estimate      p.value
## unpackaged            0.69474080 3.340834e-35
## tea shop              0.69465343 1.212375e-32
## black                 0.24172246 7.129309e-15
## No.sugar              0.20372089 8.232202e-11
## +60                   0.38974921 2.728687e-09
## green                 0.15432353 2.710546e-04
## 45-59                 0.11640619 5.815735e-04
## lemon                 0.09776924 3.157531e-03
## other                 0.35244199 4.165452e-03
## 35-44                 0.13205315 4.331606e-03
## tea bag+unpackaged   -0.16991252 4.203783e-02
## alone                -0.22634169 2.399737e-02
## chain store+tea shop -0.09539809 6.112076e-06
## sugar                -0.20372089 8.232202e-11
## Earl Grey            -0.39604598 2.004995e-22
## tea bag              -0.52482827 9.453587e-23
## 15-24                -0.60392196 3.365056e-30
## chain store          -0.59925534 2.883891e-33
## 
## 
## $`Dim 2`
## $`Dim 2`$quali
##              R2      p.value
## where 0.6318637 3.566828e-65
## how   0.3343163 5.692320e-27
## How   0.2947635 2.694094e-22
## Tea   0.1540126 1.635070e-11
## age_Q 0.1689345 3.629091e-11
## 
## $`Dim 2`$category
##                         Estimate      p.value
## chain store+tea shop  0.70929060 7.770361e-47
## tea bag+unpackaged    0.47161067 3.582736e-19
## other                 1.02576718 8.526004e-19
## black                 0.32725469 2.712739e-06
## +60                   0.31029257 3.430996e-06
## 35-44                 0.19254061 1.476821e-03
## tea bag               0.02118156 2.197206e-03
## milk                 -0.22316139 4.765518e-03
## 25-34                -0.33024542 1.272642e-07
## green                -0.40562971 2.539273e-09
## chain store          -0.03682077 2.085882e-09
## alone                -0.52113722 1.777027e-10
## unpackaged           -0.49279223 1.415218e-14
## tea shop             -0.67246984 5.949668e-20
## 
## 
## $`Dim 3`
## $`Dim 3`$quali
##              R2      p.value
## age_Q 0.3536548 5.876798e-27
## how   0.2585093 5.136908e-20
## where 0.2330683 7.697792e-18
## How   0.2348039 4.237186e-17
## Tea   0.2161434 1.968505e-16
## sugar 0.2003056 3.497087e-16
## 
## $`Dim 3`$category
##                         Estimate      p.value
## 25-34                 0.45976930 2.116883e-16
## Earl Grey             0.28300167 3.119648e-16
## tea bag+unpackaged    0.23830348 3.465086e-16
## sugar                 0.22363951 3.497087e-16
## lemon                 0.64614264 5.936838e-16
## chain store+tea shop  0.15382682 3.249970e-11
## tea shop              0.18552512 5.921016e-05
## 15-24                 0.16779074 5.659078e-03
## unpackaged            0.07220445 3.485880e-02
## +60                  -0.15408157 7.466100e-03
## other                -0.46305098 3.461660e-03
## milk                 -0.17423380 6.751923e-04
## black                -0.25323911 1.756827e-14
## 45-59                -0.37793254 5.344405e-15
## No.sugar             -0.22363951 3.497087e-16
## chain store          -0.33935194 6.755202e-19
## tea bag              -0.31050793 2.755334e-20

Earlier summary of the MCA did not give me much information, but with the dimdesc we can see that people belonging to Dim 1 are those who prefer drinkin unpackaged tea, who buy it from a tea shop, drink it black without sugar and sometimes add a slice of lemon or drink green tea. For those estimates in the list that are negative, goes that if the individual has these characteristics, it is less likely for him/her belonging into this group.